{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 14 - Hypothesis testing for proportions\n", "\n", "This lab is an introduction to hypothesis testing for proportions (or probabilities or percentages), and follows the two scenarios presented in [11.1 Assessing Models](https://www.inferentialthinking.com/chapters/11/1/Assessing_Models.html) in the [Computational and Inferential Thinking](https://www.inferentialthinking.com/chapters/intro) textbook. \n", "\n", "Hypothesis testing is a method for estimating the probability of some data occuring according to a certain scenario (the *null hypothesis*). The null hypothesis is rejected (or the *alternative hypothesis* is accepted) if this probability is small enough.\n", "\n", "First, let's import all necessary libraries." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### U.S. Supreme Court, 1965: Swain vs. Alabama\n", "See [https://www.inferentialthinking.com/chapters/11/1/Assessing_Models.html#us-supreme-court-1965-swain-vs-alabama] for a description of the case.\n", "\n", "We want to simulate the probability that if 100 people are chosen at random from a population that is 26% Black, 8 or fewer of them are Black.\n", "\n", "First we need to define the population (Black and Non-Black) and specify the probability for each group, as in Lab 12. We can also call this population our *model*, because we are assuming it is an explanation for the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " population = [\"Black\",\"Non-Black\"]\n", "pop_prob = [0.26,1-0.26]\n", "
\n", "\n", "Next, simulate just one sample of 100 people randomly chosen from the population, and count the number of Black people in the sample.\n", "\n", "The steps are:\n", "- take a sample size 100 from the population\n", "- convert the sample into a Pandas series and compute the value counts\n", "- from the value counts, find the number of Black people in the sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " sample = np.random.choice(population, p= pop_prob, size = 100)\n", "sample_counts = pd.Series(sample).value_counts()\n", "sample_counts[\"Black\"]\n", "
\n", "\n", "Now we want to repeat this 10,000 times to take 10,000 different samples and save the number of Black people in each sample in a list. Can you set up the loop to do this? \n", "\n", "Hint: Use a smaller number than 10,000 while you are testing your loop. Once it's working, change it back to 10,000." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " counts = []\n", "for i in range(10000):\n", " sample = np.random.choice(population, p= pop_prob, size = 100)\n", " sample_counts = pd.Series(sample).value_counts()\n", " counts.append(sample_counts[\"Black\"])\n", "
\n", "\n", "Finally, let's plot a histogram of the counts of Black people in the 10,000 samples. Remember you will have to convert your list into a Pandas Series first." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " pd.Series(counts).hist()\n", "
\n", "\n", "This histogram gives us a good estimate of the distribution of the number of Black men on a jury panel that has been randomly selected from the population. What's the probability that there are 8 or fewer Black men on the jury panel? We can't figure out the probabilty as an exact number, but is it big? Small? Very small?\n", "\n", "The probabily that only 8 Black men would be selected for the jury panel if it was chosen completely at random is very, very small. It could have happened, but if there is another explanation (ex. jury panel was manipulated due to racism) with a higher probability then that explanation is more likely. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mendel’s Pea Flowers\n", "\n", "Gregor Mendel is the father of modern genetics and conducted many breeding experiments on pea plants (he grew 28,000 pea plants over his lifetime!). The textbook explanation is [here](https://www.inferentialthinking.com/chapters/11/1/Assessing_Models.html#mendels-pea-flowers), but there is a more detailed explanation [here](https://www.khanacademy.org/science/high-school-biology/hs-classical-genetics/hs-introduction-to-heredity/a/mendel-and-his-peas).\n", "\n", "Mendel's model is that pea plants will have purple flowers 75% of the time and white flowers 25% of the time. In one test, he grew 929 pea plants and 705 had purple flowers. Let's conduct a simulation to test if this data has a high probability of occuring under this model. That is, let's test that the *data is consistent with the model*.\n", "\n", "First set up the population/model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, simulate one sample from the population and count the number of purple flowers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then repeat your simulation 10,000 times." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, plot a histogram of the counts of purple flowers in the simulations." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the probabilty (high, low, etc.) that 705 flowers were purple? Do you think Mendel's data is consistent with his model?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }